Fall 2020 - DATS 6103 - Project 3 - Meron Metaferia

Movies and TV Shows Analysis: Exploration of Four Major Streaming Platforms

image.png

Introduction

There is a plethora of streaming services available for anyone with a credit card handy. For a person trying to choose a streaming platform, it can be an overly crowded space with so many choices. In this project, I will analyze a web scraped data that I found on kaggle to compare the 4 big streaming platforms - Netflix, Hulu, Prime Video, Disney+. By the end of this analysis, I am sure you'll be able to choose which streaming service best suits your needs. Enjoy! 😃

**Targets for the Analysis**

  • Movie and TV Shows Distribution by Platform
  • Movies Category by Language and Country
  • Average Age Rating for Movies on each streaming platform
  • Movies distribution by genre
  • TV shows and Movies with highest IMDb ratings
  • Directors with highest and lowest IMDb ratings
  • Movie Availability - Interactive

**Data Source**

Dataset: data
Title Image: Image
That's all folks gif: gif

**1.Importing Packages**

In [182]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
from plotly import graph_objects as go
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import plotly.express as px
from ipywidgets import widgets

import warnings
warnings.filterwarnings('ignore')

**2.Reading Dataset**

In [183]:
mov = pd.read_csv('Movies.csv')
tv = pd.read_csv('tv_shows.csv')
In [184]:
mov.head()
Out[184]:
Unnamed: 0 ID Title Year Age IMDb Rotten Tomatoes Netflix Hulu Prime Video Disney+ Type Directors Genres Country Language Runtime
0 0 1 Inception 2010 13+ 8.8 87% 1 0 0 0 0 Christopher Nolan Action,Adventure,Sci-Fi,Thriller United States,United Kingdom English,Japanese,French 148.0
1 1 2 The Matrix 1999 18+ 8.7 87% 1 0 0 0 0 Lana Wachowski,Lilly Wachowski Action,Sci-Fi United States English 136.0
2 2 3 Avengers: Infinity War 2018 13+ 8.5 84% 1 0 0 0 0 Anthony Russo,Joe Russo Action,Adventure,Sci-Fi United States English 149.0
3 3 4 Back to the Future 1985 7+ 8.5 96% 1 0 0 0 0 Robert Zemeckis Adventure,Comedy,Sci-Fi United States English 116.0
4 4 5 The Good, the Bad and the Ugly 1966 18+ 8.8 97% 1 0 1 0 0 Sergio Leone Western Italy,Spain,West Germany Italian 161.0
In [185]:
#shape of movies df
mov.shape
Out[185]:
(16744, 17)
In [186]:
#columns in movies df
mov.columns
Out[186]:
Index(['Unnamed: 0', 'ID', 'Title', 'Year', 'Age', 'IMDb', 'Rotten Tomatoes',
       'Netflix', 'Hulu', 'Prime Video', 'Disney+', 'Type', 'Directors',
       'Genres', 'Country', 'Language', 'Runtime'],
      dtype='object')
In [187]:
tv.head()
Out[187]:
Unnamed: 0 Title Year Age IMDb Rotten Tomatoes Netflix Hulu Prime Video Disney+ type
0 0 Breaking Bad 2008 18+ 9.5 96% 1 0 0 0 1
1 1 Stranger Things 2016 16+ 8.8 93% 1 0 0 0 1
2 2 Money Heist 2017 18+ 8.4 91% 1 0 0 0 1
3 3 Sherlock 2010 16+ 9.1 78% 1 0 0 0 1
4 4 Better Call Saul 2015 18+ 8.7 97% 1 0 0 0 1
In [188]:
#shape of tv shows df
tv.shape
Out[188]:
(5611, 11)
In [189]:
#a summary on movies dataset
mov.describe()
Out[189]:
Unnamed: 0 ID Year IMDb Netflix Hulu Prime Video Disney+ Type Runtime
count 16744.000000 16744.000000 16744.000000 16173.000000 16744.000000 16744.000000 16744.000000 16744.000000 16744.0 16152.000000
mean 8371.500000 8372.500000 2003.014035 5.902751 0.212613 0.053930 0.737817 0.033684 0.0 93.413447
std 4833.720789 4833.720789 20.674321 1.347867 0.409169 0.225886 0.439835 0.180419 0.0 28.219222
min 0.000000 1.000000 1902.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 1.000000
25% 4185.750000 4186.750000 2000.000000 5.100000 0.000000 0.000000 0.000000 0.000000 0.0 82.000000
50% 8371.500000 8372.500000 2012.000000 6.100000 0.000000 0.000000 1.000000 0.000000 0.0 92.000000
75% 12557.250000 12558.250000 2016.000000 6.900000 0.000000 0.000000 1.000000 0.000000 0.0 104.000000
max 16743.000000 16744.000000 2020.000000 9.300000 1.000000 1.000000 1.000000 1.000000 0.0 1256.000000
In [190]:
#a summary on tv shows dataset
tv.describe()
Out[190]:
Unnamed: 0 Year IMDb Netflix Hulu Prime Video Disney+ type
count 5611.000000 5611.000000 4450.000000 5611.000000 5611.000000 5611.000000 5611.000000 5611.0
mean 2805.000000 2011.021030 7.113258 0.344145 0.312600 0.382107 0.032080 1.0
std 1619.900511 11.005116 1.132060 0.475131 0.463594 0.485946 0.176228 0.0
min 0.000000 1901.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.0
25% 1402.500000 2010.000000 6.600000 0.000000 0.000000 0.000000 0.000000 1.0
50% 2805.000000 2015.000000 7.300000 0.000000 0.000000 0.000000 0.000000 1.0
75% 4207.500000 2017.000000 7.900000 1.000000 1.000000 1.000000 0.000000 1.0
max 5610.000000 2020.000000 9.600000 1.000000 1.000000 1.000000 1.000000 1.0

**3. Data Cleaning**

In [191]:
#Dropping Unnamed and ID columns
mov = mov.drop(['Unnamed: 0'], axis = 1)
mov.head()
Out[191]:
ID Title Year Age IMDb Rotten Tomatoes Netflix Hulu Prime Video Disney+ Type Directors Genres Country Language Runtime
0 1 Inception 2010 13+ 8.8 87% 1 0 0 0 0 Christopher Nolan Action,Adventure,Sci-Fi,Thriller United States,United Kingdom English,Japanese,French 148.0
1 2 The Matrix 1999 18+ 8.7 87% 1 0 0 0 0 Lana Wachowski,Lilly Wachowski Action,Sci-Fi United States English 136.0
2 3 Avengers: Infinity War 2018 13+ 8.5 84% 1 0 0 0 0 Anthony Russo,Joe Russo Action,Adventure,Sci-Fi United States English 149.0
3 4 Back to the Future 1985 7+ 8.5 96% 1 0 0 0 0 Robert Zemeckis Adventure,Comedy,Sci-Fi United States English 116.0
4 5 The Good, the Bad and the Ugly 1966 18+ 8.8 97% 1 0 1 0 0 Sergio Leone Western Italy,Spain,West Germany Italian 161.0
In [192]:
#Dropping Unnamed columns
tv = tv.drop(['Unnamed: 0'], axis = 1) 
tv.head()
Out[192]:
Title Year Age IMDb Rotten Tomatoes Netflix Hulu Prime Video Disney+ type
0 Breaking Bad 2008 18+ 9.5 96% 1 0 0 0 1
1 Stranger Things 2016 16+ 8.8 93% 1 0 0 0 1
2 Money Heist 2017 18+ 8.4 91% 1 0 0 0 1
3 Sherlock 2010 16+ 9.1 78% 1 0 0 0 1
4 Better Call Saul 2015 18+ 8.7 97% 1 0 0 0 1
In [193]:
#adding ID column for TV shows df
tv["ID"] = tv.index + 1
tv.head()
Out[193]:
Title Year Age IMDb Rotten Tomatoes Netflix Hulu Prime Video Disney+ type ID
0 Breaking Bad 2008 18+ 9.5 96% 1 0 0 0 1 1
1 Stranger Things 2016 16+ 8.8 93% 1 0 0 0 1 2
2 Money Heist 2017 18+ 8.4 91% 1 0 0 0 1 3
3 Sherlock 2010 16+ 9.1 78% 1 0 0 0 1 4
4 Better Call Saul 2015 18+ 8.7 97% 1 0 0 0 1 5

**4. Data Visualization**

TV Shows and Movies Distribution Accross Platforms

In [13]:
def summ(df,b):
    return df[b].sum(axis=0)
In [14]:
#Counting the number of movies and Tv shows in each platform
counts = []
df = [mov,tv]
cols = ['Netflix','Hulu','Prime Video','Disney+']

for x in df:
    for y in cols:
        counts.append(summ(x,y))
In [15]:
counts
Out[15]:
[3560, 903, 12354, 564, 1931, 1754, 2144, 180]
In [16]:
#Setting Default values for the subplots
def pieplot(i,df,portion,title):
    '''Function to set default values to plot Platform 
       Movie Distribution for Movies and TV shows'''
    plt.subplot(i)
    plt.pie(portion, explode=explode, labels=labels, colors=colors, shadow = True, autopct='%1.1f%%')

    fig = plt.gcf()
    plt.title(title)
    plt.axis('equal')
In [17]:
#increase font size
import matplotlib as mpl
mpl.rcParams['font.size'] = 15.0
In [18]:
#plotting
fig = plt.subplots(figsize=(17, 10))
labels = 'Netflix', 'Hulu','Prime Video','Disney+'
portion_m = [counts[0], counts[1],counts[2],counts[3]]
portion_t = [counts[4], counts[5],counts[6],counts[7]]
colors = ['r', 'springgreen', 'deepskyblue', 'darkblue']
explode = (0.1, 0, 0, 0) 

pieplot(121,mov,portion_m,'Movies')
pieplot(122,tv,portion_t,'TV shows')
plt.show()
  • Eventhough, Netflix known as the best streaming platform with ~180m subscribers, we can see that prime video has the most content both in TV shows and Movies.
  • For the case of TV shows, we can see that both Netflix and Prime have similar amount of TV Shows, but Prime Video is still leader with most contents.

Movie Count by Country and Language

Aggregating Movies dataset by language and counting the titles in each language

In [19]:
movie_count_by_language = mov.groupby('Language')['Title'].count().reset_index().sort_values('Title',ascending = False).head(10).rename(columns = {'Title':'Movie Count'})
fig = px.bar(movie_count_by_language, x='Language', y='Movie Count', color='Movie Count', title = 'Movie Count by Language', height=500)
fig.show()
  • In all 4 streaming platforms, we can see that English Language movies are the most popular and ones, this should come as no surprise, with Holywood being the worlds leading film industry.

Group movies dataset by country and counting titles in each country category

In [137]:
movies_by_country = mov.groupby('Country')['Title'].count().reset_index().sort_values('Title',ascending = False).head(10).rename(columns = {'Title':'MovieCount'})
fig = px.pie(movies_by_country,names='Country', values='MovieCount')
fig.update_traces(rotation=180, pull=[0.1,0.03,0.03,0.03,0.03],textinfo="percent")
fig.update_layout(showlegend=True, title_text = 'Movie Count by Country',font=dict(
        family="Courier New, monospace",
        size=18,
        color="black"))
#fig.update_layout(
    #font_family="Courier New",
    #font_color="black",
    #title_font_family="Times New Roman",
    #title_font_color="blue",
    #legend_title_font_color="black")

fig.show()
  • And USA is the major producer, of movies in all the streaming platforms.

Movie Age Rating On Each Platform

In [21]:
#Counting movies in each age category
age_netflix = mov[mov.Netflix == 1].groupby(['Age', 'Netflix']).count()['ID'].reset_index()[['Age', 'ID']]
age_hulu = mov[mov.Hulu == 1].groupby(['Age', 'Hulu']).count()['ID'].reset_index()[['Age', 'ID']]
age_prime = mov[mov['Prime Video'] == 1].groupby(['Age', 'Prime Video']).count()['ID'].reset_index()[['Age', 'ID']]
age_disney = mov[mov['Disney+'] == 1].groupby(['Age', 'Disney+']).count()['ID'].reset_index()[['Age', 'ID']]
In [218]:
fig = go.Figure()
fig.update_layout(title_text = 'Movies Age Appropriatness')

fig.add_trace(go.Funnel(
     name = 'Netflix',
     y = age_netflix.Age,
     x = age_netflix['ID'],
     textinfo = "value",
     marker = {'color': 'red'}))

fig.add_trace(go.Funnel(
     name = 'Prime',
     orientation = 'h',
     y = age_prime.Age,
     x = age_prime['ID'],
     textposition = 'inside',
     textinfo = "value",
     marker = {'color': 'deepskyblue'}))

fig.add_trace(go.Funnel(
     name = 'Hulu',
     orientation = 'h',
     y = age_hulu.Age,
     x = age_hulu['ID'],
     textposition = 'inside',
     textinfo = "value",
     marker = {'color' : 'lime'}))

fig.add_trace(go.Funnel(
     name = 'Disney+',
     y = age_disney.Age,
     x = age_disney['ID'],
     textposition = 'outside',
     textinfo = "value",
     marker = {'color' : 'navy'}))

fig.show()
  • Age rating data on our dataset is not fully complete, but this gives us a rough idea of movies distribution in each age category.
  • Prime Video has the most contents in each age groups followed by Netflix.

TV Shows Age Rating On Each Platform

In [23]:
#counting TV shows in each age category
age_n = tv[tv.Netflix == 1].groupby(['Age', 'Netflix']).count()['ID'].reset_index()[['Age', 'ID']]
age_h = tv[tv.Hulu == 1].groupby(['Age', 'Hulu']).count()['ID'].reset_index()[['Age', 'ID']]
age_p = tv[tv['Prime Video'] == 1].groupby(['Age', 'Prime Video']).count()['ID'].reset_index()[['Age', 'ID']]
age_d = tv[tv['Disney+'] == 1].groupby(['Age', 'Disney+']).count()['ID'].reset_index()[['Age', 'ID']]
In [24]:
#labels for pie charts since all 4 have same age groups
labels = age_n.Age

# Define color sets of age groups
colors = ['deeppink', 'blue', 'yellow', 'navy', 'green']

# Create subplots, using 'domain' type for pie charts
specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]]

#fig = make_subplots(1,2, specs=specs,subplot_titles=['Netflix', 'Hulu', 'Prime', 'Disney+'])
fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles = ['Netflix', 'Hulu', 'Prime Video', 'Disney+'])

# Define pie charts
fig.add_trace(go.Pie(labels=labels, values=age_n['ID'], name='Netflix', hole = .3,
                     marker_colors=colors), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=age_h['ID'], name='Hulu', hole = .3,
                     marker_colors=colors), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=age_p['ID'], name='Prime Video', hole = .3,
                     marker_colors=colors), 2, 1)
fig.add_trace(go.Pie(labels=labels, values=age_d['ID'], name='Disney+', hole = .3,
                     marker_colors=colors), 2, 2)


# Tune layout and hover info
fig.update_traces(hoverinfo='label+percent+name', textinfo='percent')

#title alignment 
fig.update_layout(title={'text':'TV Shows Age Rating', 'y':0.98, 'x':0.5, 'xanchor': 'center','yanchor': 'top'})

#to increase pie chart size
fig.update_layout(margin=dict(l=20, r=20, t=20, b=20, pad = 10))

#adding labels
fig.update(layout_showlegend=True)

fig = go.Figure(fig)
fig.show()

For TV shows we can observe that:

    * Netflix has most contents for 16+ and least contents for 13+
    * Hulu has most contents for 13+ and least contents for all
    * Prime Video has most contents for 7+ and least contents for 13+
    * Disney+ has most contents for 18+ and least contents for all

Movies Count by Genre

In [25]:
genre_n = mov[mov.Netflix == 1].groupby(['Genres', 'Netflix']).count()['ID'].reset_index()[['Genres', 'ID']]
genre_h = mov[mov.Hulu == 1].groupby(['Genres', 'Hulu']).count()['ID'].reset_index()[['Genres', 'ID']]
genre_p = mov[mov['Prime Video'] == 1].groupby(['Genres', 'Prime Video']).count()['ID'].reset_index()[['Genres', 'ID']]
genre_d = mov[mov['Disney+'] == 1].groupby(['Genres', 'Disney+']).count()['ID'].reset_index()[['Genres', 'ID']]
In [232]:
# seperating movies by streaming platforms
netflix = mov.loc[mov['Netflix'] == 1]
hulu = mov.loc[mov['Hulu']==1]
prime_video = mov.loc[mov['Prime Video']]
disney = mov.loc[mov['Disney+']]
In [233]:
# dropping columns of other platforms and unnecessary columns

netflix = netflix.drop(['Hulu', 'Prime Video', 'Disney+', 'Type'], axis = 1)
hulu = hulu.drop(['Netflix', 'Prime Video', 'Disney+', 'Type'], axis = 1)
prime = prime_video.drop(['Hulu', 'Netflix', 'Disney+', 'Type'], axis = 1)
disney = disney.drop(['Hulu', 'Prime Video', 'Netflix', 'Type'], axis = 1)
In [234]:
# seperating TV Shows by streaming platforms
net = tv.loc[tv['Netflix'] == 1]
hul = tv.loc[tv['Hulu']==1]
pri = tv.loc[tv['Prime Video']]
dis = tv.loc[tv['Disney+']]
In [235]:
# dropping columns for TV shows of other platforms and unnecessary columns

net = net.drop(['Hulu', 'Prime Video', 'Disney+'], axis = 1)
hul = hul.drop(['Netflix', 'Prime Video', 'Disney+'], axis = 1)
pri = pri.drop(['Hulu', 'Netflix', 'Disney+'], axis = 1)
dis = dis.drop(['Hulu', 'Prime Video', 'Netflix'], axis = 1)
In [236]:
def genre(df,title):
    genres_count = df.groupby('Genres', as_index = False).count()
    genres_count = genres_count[['Genres', 'ID']].rename({'ID' : 'Count'}, axis = 'columns')
    genres_count = genres_count.sort_values(by = 'Count', ascending = False)
    fig = px.bar(genres_count.head(15), y='Genres', x="Count", color='Genres', #showing only the top 15 genres
                 orientation="h", title=title)
    
    fig.update_layout(
        paper_bgcolor='rgba(0,0,0,0)',
        plot_bgcolor='rgba(0,0,0,0)',
        #title="Number of movies segmented by genre",
        xaxis_tickfont_size=14,
        yaxis=dict(
            title='Movie genre',
            titlefont_size=16,
            tickfont_size=14
        ),
        legend=dict(
            x=1,
            y=1.0,
            bgcolor='rgba(255, 255, 255, 0)',
            bordercolor='rgba(255, 255, 255, 0)'),)
    fig.show()
In [237]:
genre(netflix,'Netflix: Movie distribution by Genre')
  • Netflix has most Comedy Contents
In [238]:
genre(hulu,'Hulu: Movie distribution by Genre')
  • Hulu has more documentary contents
In [98]:
genre(prime,'Prime Video: Movie distribution by Genre')
  • Prime Video has most of its movies in the Action, Sci-Fi mix category
In [99]:
genre(disney,'Disney+: Movie distribution by Genre')
  • Disney+ has most movies in the Action, Adventure, Si-Fi, thriller category

Movies with Highest IMDb Rating on Each Platform by Genre

  • In these treemaps, we can see selected movies with IMDb rating of 8.5 and above. The treemaps show the movies in each rating category, with the title, genre an director name.
In [158]:
def treemap(df,platform,color,title):
    df=df.loc[df[platform] == 1]
    rate = df.sort_values(by='IMDb', ascending=False)
    rate = rate[0:15]
    rate['Movies']='Movies'
    fig = px.treemap(rate, path=['IMDb','Title', 'Genres','Directors'], 
                     color='IMDb',title = title, color_continuous_scale=color)#values='IMDb',
    fig.show()
In [159]:
treemap(mov,'Netflix','reds','Netflix')
treemap(mov,'Hulu','greens','Hulu')
treemap(mov,'Prime Video','blues','Prime Video')
treemap(mov,'Disney+','bupu','Disney+')

TV Shows with Highest Rating on Each Platform

  • In these sunburst pie charts, we can see selected movies with IMDb rating of 8.5 and above. The pie chars show the TV shows in each rating category, and the year TV show was released.
In [38]:
def sun_t(df,platform,color,title):
    df=df.loc[df[platform] == 1]
    df=df.sort_values(by='IMDb', ascending=False)
    rate = df[0:15]
    fig =px.sunburst(rate,path=['Title','Year'],values='IMDb',color='IMDb',color_continuous_scale=color,title=title)
    fig.show()
In [39]:
sun_t(tv,'Netflix','hot', 'Netflix')
sun_t(tv,'Hulu','greens','Hulu')
sun_t(tv,'Prime Video','blues','Prime Video')
sun_t(tv,'Disney+','electric','Disney+')

Run Time of Movies with 9.3 IMDb Rating on all Platforms

  • Now lets say you wanted to watch the best rated movies but you wanted to know the run time of these movies. The following graph shows, the run times of movies with IMDb rating of 9.3. The times listed are in seconds. The lengthiest movie in the graph is Bounty (the movie Mel Gibson stars in) with a run time of 132secs (2h 12m)
In [249]:
runtime_top=mov.loc[mov['IMDb']==9.3][['Title','Runtime','IMDb']]
fig = px.bar(runtime_top, x='Title', y='Runtime', color='Runtime', height=500, 
             title='Runtime of Movies with 9.3 IMDb rating')
fig.show()

Movies and Directors

Which directors are top and lowest contents producers?

In [206]:
# Directors who directed the best and worst IMDb ranked movies

n = 10
x="Directors"

best = mov.groupby(by="Directors").mean().sort_values(by="IMDb",ascending=False).reset_index().iloc[:n]
worst = mov.groupby(by="Directors").mean().sort_values(by="IMDb",ascending=True).reset_index().iloc[2:n]

# For the worst IMDb average the worst two were dropped
# because they had an average rating of 0

fig = go.Figure(go.Funnelarea(
    text = best.Directors,
    values = best.IMDb,
    textinfo='value+text',
    showlegend=False,
    title = f"Top {n} Directors with the Highest Average IMDb Movie Ratings Across Platforms",
    ))

#to increase funnel chart size
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0,pad=0))
fig.update_layout(
    font=dict(
        family="Courier New, monospace",
        size=25,
        color="black"))

fig.show()
In [205]:
fig = go.Figure(go.Funnelarea(
    text = worst.Directors,
    values = worst.IMDb,
    textinfo='value+text',
    showlegend=False,
    title = f"Top {n} Directors with the lowest Average IMDb Movie Ratings Across Platforms",
    ))

#to increase funnel chart size
fig.update_layout(margin=dict(l=0, r=0, t=0, b=0,pad=0))
fig.update_layout(
    font=dict(
        family="Courier New, monospace",
        size=25,
        color="black"))

fig.show()

Movie Availability

  • This is an interactive section where you can plug the title you want on cell where it says title, then you can check in which streaming platforms is the movie avaliable.
In [253]:
def is_movie_ava(title):
    
    N_ava=netflix.loc[netflix['Title']==(title)]
    P_ava=prime.loc[prime['Title']==(title)]
    D_ava=disney.loc[disney['Title']==(title)]
    H_ava=hulu.loc[hulu['Title']==(title)]
    
        
    if (len(N_ava) >0):
            print('It is available on Netflix! :)')
    else:
            print('Sorry, this movie is not Available on Netflix. :(')

    if (len(P_ava) >0):
            print('It is available on PrimeVideo! :)')
    else:
            print('Sorry, this movie not Available on PrimeVideo:(')

    if (len(D_ava) >0):
            print('It is available on Disney+! :)')
    else:
            print('Sorry, this movie is not Available on Disney+. :(')

    if (len(H_ava) >0):
            print('It is available on Hulu! :)')
    else:
            print('Sorry, this movie is not Available on Hulu. :(')

Input a movie title you want to watch

In [264]:
title='The Matrix' 
is_movie_ava(title)
It is available on Netflix! :)
It is available on PrimeVideo! :)
It is available on Disney+! :)
Sorry, this movie is not Available on Hulu. :(

TV Shows Availability

    • This is an interactive section where you can plug the title you want on cell where it says title, then you can check in which streaming platforms is the TV show avaliable.
In [242]:
# seperating TV Shows by streaming platforms
net = tv.loc[tv['Netflix'] == 1]
hul = tv.loc[tv['Hulu']==1]
pri = tv.loc[tv['Prime Video']]
dis = tv.loc[tv['Disney+']]

# dropping columns for TV shows of other platforms and unnecessary columns

net = net.drop(['Hulu', 'Prime Video', 'Disney+'], axis = 1)
hul = hul.drop(['Netflix', 'Prime Video', 'Disney+'], axis = 1)
pri = pri.drop(['Hulu', 'Netflix', 'Disney+'], axis = 1)
dis = dis.drop(['Hulu', 'Prime Video', 'Netflix'], axis = 1)
In [259]:
def is_tvshow_ava(title):
    
    N_av=net.loc[net['Title']==(title)]
    P_av=pri.loc[pri['Title']==(title)]
    D_av=dis.loc[dis['Title']==(title)]
    H_av=hul.loc[hul['Title']==(title)]
    
        
    if (len(N_av) >0):
            print('It is available on Netflix! :)')
    else:
            print('Sorry, this TV Show is not Available on Netflix. :(')

    if (len(P_av) >0):
            print('It is available on PrimeVideo! :)')
    else:
            print('Sorry, this TV Show not Available on PrimeVideo:(')

    if (len(D_av) >0):
            print('It is available on Disney+! :)')
    else:
            print('Sorry, this TV show is not Available on Disney+. :(')

    if (len(H_av) >0):
            print('It is available on Hulu! :)')
    else:
            print('Sorry, this TV show is not Available on Hulu. :(')

Input a Tv Show title you want to watch

In [260]:
title='Breaking Bad'
is_tvshow_ava(title)
It is available on Netflix! :)
It is available on PrimeVideo! :)
It is available on Disney+! :)
Sorry, this TV show is not Available on Hulu. :(

**5. Conclusion**

Movies Dataset Summary

  • Latest movie release date is 2020 and the oldest 1902.
  • Average Movie rating is 5.9
  • Average runtime is 93 min

TV Shows Dataset summary

  • Latest TV show release date is 2020 and the oldest 1901.
  • Average Movie rating is 7.5

**From the visualization:**

Most contents

           * Eventhough, Netflix known as the best streaming platform with ~180m subscribers, we can see that prime video has the most content both in TV shows and Movies.              
           * For the case of TV shows, we can see that both Netflix and Prime have similar amount of TV Shows, but Prime Video is still leader with most contents.
           * All platforms contain most movies English language, and movies made in the USA.

Age Appropriatness

           * Movie: Prime Video has the most contents in each age groups followed by Netflix.
           * TV shows:
                    * Netflix has most contents for 16+ and least contents for 13+
                    * Hulu has most contents for 13+ and least contents for all
                    * Prime Video has most contents for 7+ and least contents for 13+
                    * Disney+ has most contents for 18+ and least contents for all

Movies by Genres

           * Netflix has most Comedy Contents
           * Hulu has most Documentary Contents
           * Prime Video and Disney+ has most of their movies in action, thriller, sci-fi genres.

Run time of Best Rated Movies(IMDb = 9.3)

           * Bounty was the lengthiest best rated movie with a runtime of 2h 12m.

Directors with high average IMDb rating of above 9.0

           * Miguel GaudĂȘncio: known for Offside (Documentary: 2019)
           * Fen Tian: known for Love on a Leash (Comedy, Drama, Fantasy: 2011)
           * Danny Wu: Known for Square One: Michael Jackson (Documentary: 2019)
           * Chris Leslie: Known for A killer Men (Short, Action, Crime: 2015)
           * Oggi Tomic: Known for Belongin The Truth Behind the Headlines (Documentary: 2017)
           * Rel Dowdell: known for Where's Daddy? (Documentary: 2017)
           * Paul Kakert: Known for Escape from Firebase Kate (Documentary: 2015)


Improvements

           * The analysis would be a lot more robust if it contained more movies and TV shows since a
           lot of movies might  still be missing and this affects the results of our data. In addition,
           if more Rotten Tomatoes Ratings and Age rating have more values because currently it has 
           more missing values.

Github and Zendo Links

Learning Process

It was really fun working in this project. I learned a lot about mangaing relatively big dataset and working with interesting plots offered by plotly. In addition, I learned more about the big 4 streaming platforms. This was a very interesting and resourceful class.

Thank you, Prof. Zahadat!

In [ ]: